multiple model
A Consistency-Centric Approach to Set-Based Optimization with Multiple Models of Unranked Fidelity
Morey, Danielle F., Pedrielli, Giulia, Wakayama, Cherry Y., Zabinsky, Zelda B.
In complex real-world settings, optimization is challenged by the presence of diverse models of differing fidelity. In many optimization problems, a single model is treated as the most accurate representation of the underlying system, while other models are evaluated primarily by their agreement with this presumed most accurate model. Yet in real-world applications, model accuracy is rarely known a priori and assuming a single most accurate model can be misleading. This paper addresses this gap by proposing a flexible set-based optimization methodology called Set-Based Optimization with Multiple Models (S-BOMM) that works with multiple models without the assumption of a most accurate high-fidelity model. Unlike traditional optimization approaches that focus on finding an optimal solution according to the high-fidelity model, our methodology utilizes consistency between models to identify good solutions across multiple models. A probabilistic analysis of the consistency method is provided that bounds the likelihood of the methodology producing correct or incorrect results. Empirical results demonstrate the effectiveness of S-BOMM on test problems. By focusing on the consistency across models rather than relying on a single best solution, this set-based approach offers a practical alternative to optimization problems where multiple models must be considered without assuming a single most accurate high-fidelity model.
Fantastic Features and Where to Find Them: A Probing Method to combine Features from Multiple Foundation Models
Ramtoula, Benjamin, Lajoie, Pierre-Yves, Newman, Paul, De Martini, Daniele
Foundation models (FMs) trained with different objectives and data learn diverse representations, making some more effective than others for specific downstream tasks. Existing adaptation strategies, such as parameter-efficient fine-tuning, focus on individual models and do not exploit the complementary strengths across models. Probing methods offer a promising alternative by extracting information from frozen models, but current techniques do not scale well with large feature sets and often rely on dataset-specific hyperparameter tuning. We propose Combined backBones (ComBo), a simple and scalable probing-based adapter that effectively integrates features from multiple models and layers. ComBo compresses activations from layers of one or more FMs into compact token-wise representations and processes them with a lightweight transformer for task-specific prediction. Crucially, ComBo does not require dataset-specific tuning or backpropagation through the backbone models. However, not all models are equally relevant for all tasks. To address this, we introduce a mechanism that leverages ComBo's joint multi-backbone probing to efficiently evaluate each backbone's task-relevance, enabling both practical model comparison and improved performance through selective adaptation. On the 19 tasks of the VTAB-1k benchmark, ComBo outperforms previous probing methods, matches or surpasses more expensive alternatives, such as distillation-based model merging, and enables efficient probing of tuned models. Our results demonstrate that ComBo offers a practical and general-purpose framework for combining diverse representations from multiple FMs.
High-Rate Mixout: Revisiting Mixout for Robust Domain Generalization
Aminbeidokhti, Masih, Medeiros, Heitor Rapela, Muralidharan, Srikanth, Granger, Eric, Pedersoli, Marco
Ensembling fine-tuned models initialized from powerful pre-trained weights is a common strategy to improve robustness under distribution shifts, but it comes with substantial computational costs due to the need to train and store multiple models. Dropout offers a lightweight alternative by simulating ensembles through random neuron deactivation; however, when applied to pre-trained models, it tends to over-regularize and disrupt critical representations necessary for generalization. In this work, we investigate Mixout, a stochastic regularization technique that provides an alternative to Dropout for domain generalization. Rather than deactivating neurons, Mixout mitigates overfitting by probabilistically swapping a subset of fine-tuned weights with their pre-trained counterparts during training, thereby maintaining a balance between adaptation and retention of prior knowledge. Our study reveals that achieving strong performance with Mixout on domain generalization benchmarks requires a notably high masking probability of 0.9 for ViTs and 0.8 for ResNets. While this may seem like a simple adjustment, it yields two key advantages for domain generalization: (1) higher masking rates more strongly penalize deviations from the pre-trained parameters, promoting better generalization to unseen domains; and (2) high-rate masking substantially reduces computational overhead, cutting gradient computation by up to 45% and gradient memory usage by up to 90%. Experiments across five domain generalization benchmarks, PACS, VLCS, OfficeHome, TerraIncognita, and DomainNet, using ResNet and ViT architectures, show that our approach, High-rate Mixout, achieves out-of-domain accuracy comparable to ensemble-based methods while significantly reducing training costs.
On Arbitrary Predictions from Equally Valid Models
Lockfisch, Sarah, Schwethelm, Kristian, Menten, Martin, Braren, Rickmer, Rueckert, Daniel, Ziller, Alexander, Kaissis, Georgios
Model multiplicity refers to the existence of multiple machine learning models that describe the data equally well but may produce different predictions on individual samples. In medicine, these models can admit conflicting predictions for the same patient -- a risk that is poorly understood and insufficiently addressed. In this study, we empirically analyze the extent, drivers, and ramifications of predictive multiplicity across diverse medical tasks and model architectures, and show that even small ensembles can mitigate/eliminate predictive multiplicity in practice. Our analysis reveals that (1) standard validation metrics fail to identify a uniquely optimal model and (2) a substantial amount of predictions hinges on arbitrary choices made during model development. Using multiple models instead of a single model reveals instances where predictions differ across equally plausible models -- highlighting patients that would receive arbitrary diagnoses if any single model were used. In contrast, (3) a small ensemble paired with an abstention strategy can effectively mitigate measurable predictive multiplicity in practice; predictions with high inter-model consensus may thus be amenable to automated classification. While accuracy is not a principled antidote to predictive multiplicity, we find that (4) higher accuracy achieved through increased model capacity reduces predictive multiplicity. Our findings underscore the clinical importance of accounting for model multiplicity and advocate for ensemble-based strategies to improve diagnostic reliability. In cases where models fail to reach sufficient consensus, we recommend deferring decisions to expert review.